1.2 First steps

Do penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? How about by the island where the penguin lives? Let’s create visualizations that we can use to answer these questions.

Load libraries

Load the required libraries.

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

View the data

head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

Build a scatterplot

# ggplot function defines data source and global mapping attributes
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    y = body_mass_g
  )
) +
  # geom functions define the plot type and local mapping attributes
  geom_point(
    mapping = aes(
      colour = species,
      shape = species
    )
  ) +
  geom_smooth(method = "lm") +
  # labs adds labels
  labs(
    title = "Flipper length and body mass",
    subtitle = "Dimensions for Adelie, Chinstrap and Gentoo species",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    shape = "Species",
    colour = "Species"
  ) +
  # colour theme from ggthemes package
  scale_color_colorblind()

Exercises

  1. How many rows are in penguins? How many columns?
nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
  1. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.
?penguins
Snippet of the penguins help file showing the definition for bill_depth_mm
Snippet of the penguins help file showing the definition for bill_depth_mm
  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.
ggplot(
  penguins,
  aes(
    x = bill_length_mm,
    y = bill_depth_mm,
    colour = species
  )
) +
  geom_point()

Reviewing the scatterplot without colour added for species, there appears to be no correlation between bill length and bill depth. However, when we show the species by colour, we can see that each species appears to have a positive correlation (as bill length increases so does bill depth).

  1. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?
ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = species
  )
) +
  geom_point()

This scatterplot shows us that each species has a different range of bill depths but it does not answer the question of a relationship between bill depth and bill length.

  1. Why does the following give an error and how would you fix it?
ggplot(data = penguins) + 
  geom_point()

It gives an error because the function geom_point() requires x and y aesthetics to be defined.

  1. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.
ggplot(
  penguins,
  aes(x = flipper_length_mm,
      y = bill_depth_mm)
) +
  geom_point(na.rm = TRUE)

The na.rm argument removes all null values before creating the plot. By default it is set to FALSE.

  1. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().
# ggplot function defines data source and global mapping attributes
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    y = body_mass_g
  )
) +
  # geom functions define the plot type and local mapping attributes
  geom_point(
    mapping = aes(
      colour = species,
      shape = species
    )
  ) +
  geom_smooth(method = "lm") +
  # labs adds labels
  labs(
    title = "Flipper length and body mass",
    subtitle = "Dimensions for Adelie, Chinstrap and Gentoo species",
    caption = "Data come from the palmerpenguins package.",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    shape = "Species",
    colour = "Species"
  ) +
  # colour theme from ggthemes package
  scale_color_colorblind()

  1. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?
ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = bill_depth_mm
  )
) +
  geom_point() +
  geom_smooth(
    method = "gam"
  )

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)

  1. Will these two graphs look different? Why/why not?
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )

1.4 Visualising distributions

How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.

Categorical variable

Use a bar chart.

penguins |> 
  ggplot(
    aes(
      x = species
    )
  ) +
  geom_bar()

We can also reorder the bars based on their frequencies by transforming the variable to a factor and reordering the levels of the factor.

penguins |> 
  ggplot(
    aes(
      x = fct_infreq(species)
    )
  ) +
  geom_bar()

Numerical variable

A variable is numerical (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. Numerical variables can be continuous or discrete.

One commonly used visualization for distributions of continuous variables is a histogram.

penguins |> 
  ggplot(
    aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 200
  )

An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution.

Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.

penguins |> 
  ggplot(
    aes(x = body_mass_g)
  ) +
  geom_density()

Exercises

  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
penguins |> 
  ggplot(
    aes(y = species)
  ) +
  geom_bar()

  1. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?
ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")

On a bar plot, the fill aesthetic is more useful for changing the colour of the bars. The color aesthetic only changes the border of the bars, whereas the fill aesthetic changes the whole bar colour.

  1. What does the bins argument in geom_histogram() do?
penguins |> 
  ggplot(
    aes(x = body_mass_g)
  ) +
  geom_histogram(
    bins = 25
  )

The bins argument sets the number of bars on the histogram.

  1. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?
diamonds |> 
  ggplot(
    aes(x = carat)
  ) +
  geom_histogram(
    binwidth = 0.01
  )

When you use a binwidth of 0.01, you can see the presence of many modes within the dataset.

1.5 Visualising relationships

To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.

A numerical and a categorical variable

Use a boxplot.

penguins |> 
  ggplot(
    aes(
      x = species,
      y = body_mass_g
    )
  ) +
  geom_boxplot()

Alternatively, you could use a density plot.

penguins |> 
  ggplot(
    aes(
      x = body_mass_g,
      colour = species,
      fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  )

Two categorical variables

Use stacked bar plots.

penguins |> 
  ggplot(
    aes(
      x = island,
      fill = species
    )
  ) +
  geom_bar()

Alternatively, use a relative frequency plot.

penguins |> 
  ggplot(
    aes(
      x = island,
      fill = species
    )
  ) +
  geom_bar(
    position = "fill"
  )

Two numerical variables

Use a scatter plot.

penguins |> 
  ggplot(
    aes(
      x = flipper_length_mm,
      y = body_mass_g
    )
  ) +
  geom_point()

Three or more variables

We can incorporate more variables into the plot by mapping them to additional aesthetics (e.g. colour)

penguins |> 
  ggplot(
    aes(
      x = flipper_length_mm,
      y = body_mass_g
    )
  ) +
  geom_point(
    aes(
      colour = species,
      shape = island
    )
  )

However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

penguins |> 
  ggplot(
    aes(
      x = flipper_length_mm,
      y = body_mass_g
    )
  ) +
  geom_point(
    aes(
      colour = species
    )
  ) +
  facet_wrap(~island)

Exercises

  1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

manufacturer, model, trans, drv, fl and class are categorical. displ, year, cyl, cty and hwy are numerical.

  1. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?
mpg |> 
  ggplot(
    aes(
      x = hwy,
      y = displ,
      colour = drv,
      size = cty,
      shape = fl
    )
  ) +
  geom_point()

You cannot map a continuous variable to the shape aesthetic. When a numerical variable is mapped to colour it takes on a gradient palette but when a categorical variable is mapped to colour it takes on a palette of distinct colours.

  1. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?
mpg |> 
  ggplot(
    aes(
      x = hwy,
      y = displ,
      linewidth = drv
    )
  ) +
  geom_point()

Nothing happens - there is no line to alter the width of, so the code runs as if it wasn’t there.

  1. What happens if you map the same variable to multiple aesthetics?
mpg |> 
  ggplot(
    aes(
      x = displ,
      y = cty,
      colour = manufacturer,
      shape = manufacturer
    )
  ) +
  geom_point()

  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?
penguins |> 
  ggplot(
    aes(
      x = bill_depth_mm,
      y = bill_length_mm
    )
  ) +
  geom_point(
    aes(
      colour = species
    )
  )

penguins |> 
  ggplot(
    aes(
      x = bill_depth_mm,
      y = bill_length_mm
    )
  ) +
  geom_point() +
  facet_wrap(~species)

Colouring by species reveals clusters of points by species. Each species appears to have a positive correlation.

  1. Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species", shape = "Species")

It yields two legends because only colour was included in the labs() function. You can fix it by adding shape to the labs() functions as well.

  1. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")